Speech Processing Culture


With audio being only one-dimensional data, and speech actually being able to be captured at low sampling rates (mostly 8kHz and 16kHz, due to Nyquest theorem) and low bit-resolution (8bit or 16bit, often quantized using a perceptual scale, such as a-lawor mu-law) it constitutes the oldest research field of the ones presented here. This has several consequences:
a) Speech processing has the most advanced machine learning techniques.
b) This field can computationally afford the most advanced statistical methods.
c) Scientific progress is generally slower (“in Automatic Speech Recognition, 0.5% improvement is a PhD thesis”).
d) Speech processing has the most advanced benchmarking and testing culture.
e) The majority of the approaches do not seek to work online (i.e. realtime and incrementally as new data comes in).

The vocabulary in Speech Processing is pretty much dominated by its main subfield: Automatic Speech Recognition or ASR. Speech recognition is sometimes wrongly called “voice recognition”. Only recently, other fields have emerged, such as: Speaker Recognition (given a sample of a speaker is a the given audio recording actually from the same speaker?), Speaker Identification (Given an audio recording and a database os speaker models: “Who is in the recording?”), Speaker Diarization (Given no prior information: “Who (speaker a, b, c,...) spoke when?”), Language and Dialect Identification (Given a database of sample speech in different languages and dialect, the question is “which language is spoken?” in the test sample, Spoken Language Retrieval (given a database of speech, different questions are targeted: “Where does speaker x say something?”, “When is sentence xyz uttered?”), objective speech and audio recording quality measures (PEAQ, PESQ). Speech synthesis is a very important field too. However, since mostly no semantic analytics is performed, it wonʼt be covered by this document (at the moment).

Although audio can consists of various things other than speech, such as music and noise, speech processing is still the most dominant field. The discrimination of speech and non-speech (speech activity detection) is a research field on its own. Noise classification deals with the classification of different non-speech signals. Music processing is a an entirely new field which aims towards the improvement of tasks that musicians do, such as composing music, creating music (by playing an instrument), recording music, editing music, and playing back recordings. Music retrieval (searching music in large databases) has recently become a heavily researched field.

Speech processing usually relies on probablistic methods. Feature extraction has become very unimportant, usually a set of standard features is used, such as: MFCC, PLP, RASTA, LPCC. These were developed in the early days of signal processing. Recently, prosodic features have caught the attention of scientist. Prosodic features are statistically invariant of what people have said and thus seem to be an indication of who speaks or his or her emotional state. Prosodic features include: Pitch, higher Formants, Long-Term Average Spectrum (LTAS), Harmonics-to-Noise Ratio, Speaking Rate (syllables per second). Machine learning techniques used for various speech tasks include: Gaussian Mixture Models (GMMs), Neural Networks also called Multi-Layer Perceptrons (NNs or MLPs), Support-Vector Machines (SVMs), and Hidden Markov Models (HMMs). It seems that many tasks are always approached with an MFCC/GMM/HMM approach on the first try. This means, MFCC features are extracted from the audio tracks, GMMs are trained using expectation maximization and HMMs are used to model the time-dependencies between the frame-based GMM classification. A very important aspect of the speech community is that any result must be benchmarked carefully using a publicly available dataset such as those from NIST or Linguistic Data Consortium (LDC). Papers wont be accepted otherwise. Error measures depend on the task: In ASR, Word Error Rate (WER) is used. Speaker recognition uses Equal Error Rate (EER), and diarization uses Diarization Error Rate (DER). Other measures such as Precision/Recall and F-Score, DET and ROC curves, confusion matrices, or simple false alarms/false positive percentages are used as well. One can say, that if a speech researcher does not know his current score on a public benchmark, he or she wonʼt be taken seriously among his or her colleagues.

An example: Automatic Speech Recognition

Since ASR is the most important field in Speech Processing, here is an example of how an automatic speech recognizer works.
Speech recognition engines are usually very big systems. State-of-the-art speech recognition engines contain the work of many many PhD dissertations and are not a single-man effort. As a consequence, complete speech recognizers barely exists in universities. Universities usually only deal with certain aspects of the task. It is a domain of companies and research institutes. A natural consequence is that a speech recognition engine is divided into many small pieces that are glued together according to the task. Of course, there are training modules and test modules. Test actually stands for the actual speech recognition modules.

The big picture is as follows:
- Feature Extraction: The sampled audio data (wave file) is transformed into a different representation that allows an easier analysis. Mostly MFCCs are used.
- Speech Activity Detection: The first thing to do after feature extraction is to get rid of any non-speech that there might be. For example coughs, laughter, door slams, music, etc... This is usually done using a trained approach.
- Feature Normalization: After the features have been extracted and we got rid of all non-speech, one tries to make the features invariant to anything but the spoken words. Ideally, we want to eliminate any statistical dependency on the speaker or the channel (microphone, roomreverberation). Therefore many techniques exist to normalize features, some are verybasic, like Gaussianization, some of them are pretty advanced like Vocal Tract Length Normalization (VTLN).
- Recognition: Now that we have audio that hopefully contains only speech and features that are invariant to everything but the actual spoken words we use GMMs/MLPs/SVMs to compare the spoken words on different levels (basically we use different window length) to the recorded and annotated words in our acoustic models. So we try to recognize "a" by comparing it to all instances of "a" stored in our acoustic model. It is considered to be an "a" if it is very close to all the other "a" and not so close to any other acoustic element, such as "e" or "o". Using different window length one can compare on sub-phoneme, phoneme, syllable and word level.
- Decoder: Once small-scale recognition (e.g. phonemes, syllables, etc...) is performed, we have to glue the pieces together. This is done by using the language model. For each acoustic instance the recognizer usually outputs a set of alternatives with probabilities. So we have to make sure that individual phone combinations exist, syllables fit together to form words, and words actually form grammatical correct sentences. This is the hardest part.The problem here is to choose the most likely combination of phonemes, syllables and words, according to the recognizer output. A very hard problem is to handle words that are not part of the language model.
- Textual Postprocessing: Once decoding is done, the output probably looks like this: "andhesaidthequickbrownfoxjumpe doverthelazydognamedbruno". So what onehas to do now is take into account prosody, speech pauses, and other hints to detectsentence boundaries so that punctuation can be done. Also named entities should bedetected so that capitalization works. Hopefully the output then looks like this: And he said: "The quick brown fox jumped over the lazy dog name Bruno".

Add Discussion